TL;DR: MLlib as mentioned in this paper is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.

...read moreread less

Abstract: Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLLIB supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLLIB has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

...read moreread less

1,551 citations

Journal Article•DOI•

Social big data

[...]

Gema Bello-Orgaz¹, Jason J. Jung², David Camacho¹•Institutions (2)

Autonomous University of Madrid¹, Chung-Ang University²

01 Mar 2016-Information Fusion

TL;DR: This paper presents a revision of the new methodologies that are designed to allow for efficient data mining and information fusion from social media and of thenew applications and frameworks that are currently appearing under the “umbrella” of the social networks, socialMedia and big data paradigms.

...read moreread less

681 citations

Proceedings Article•DOI•

Simba: Efficient In-Memory Spatial Analytics

[...]

Dong Xie¹, Feifei Li¹, Bin Yao², Gefei Li², Liang Zhou², Minyi Guo² - Show less +2 more•Institutions (2)

University of Utah¹, Shanghai Jiao Tong University²

14 Jun 2016

TL;DR: Simba is a scalable and efficient in-memory spatial query processing and analytics for big spatial data that extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API.

...read moreread less

Abstract: Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial analytics systems are disk-based and optimized for IO efficiency. But increasingly, data are stored and processed in memory to achieve low latency, and CPU time becomes the new bottleneck. We present the Simba (Spatial In-Memory Big data Analytics) system that offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba is based on Spark and runs over a cluster of commodity machines. In particular, Simba extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API. It introduces indexes over RDDs in order to work with big spatial data and complex spatial operations. Lastly, Simba implements an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput. Extensive experiments over large data sets demonstrate Simba's superior performance compared against other spatial analytics system.

...read moreread less

228 citations

Journal Article•DOI•

SystemML: declarative machine learning on spark

[...]

Matthias Boehm¹, Michael W. Dusenberry¹, Deron Eriksson¹, Alexandre V. Evfimievski¹, Faraz Makari Manshadi¹, Niketan Pansare¹, Berthold Reinwald¹, Frederick Reiss¹, Prithviraj Sen¹, Arvind C. Surve¹, Shirish Tatikonda¹ - Show less +7 more•Institutions (1)

IBM¹

01 Sep 2016

TL;DR: This paper describes SystemML on Apache Spark, end to end, including insights into various optimizer and runtime techniques as well as performance characteristics.

...read moreread less

Abstract: The rising need for custom machine learning (ML) algorithms and the growing data sizes that require the exploitation of distributed, data-parallel frameworks such as MapReduce or Spark, pose significant productivity challenges to data scientists. Apache SystemML addresses these challenges through declarative ML by (1) increasing the productivity of data scientists as they are able to express custom algorithms in a familiar domain-specific language covering linear algebra primitives and statistical functions, and (2) transparently running these ML algorithms on distributed, data-parallel frameworks by applying cost-based compilation techniques to generate efficient, low-level execution plans with in-memory single-node and large-scale distributed operations. This paper describes SystemML on Apache Spark, end to end, including insights into various optimizer and runtime techniques as well as performance characteristics. We also share lessons learned from porting SystemML to Spark and declarative ML in general. Finally, SystemML is open-source, which allows the database community to leverage it as a testbed for further research.

...read moreread less

195 citations

Journal Article•DOI•

LocationSpark: a distributed in-memory data management system for big spatial data

[...]

Mingjie Tang¹, Yongyang Yu¹, Qutaibah M. Malluhi², Mourad Ouzzani³, Walid G. Aref¹ - Show less +1 more•Institutions (3)

Purdue University¹, Qatar University², Qatar Computing Research Institute³

01 Sep 2016

TL;DR: This work builds two new layers over Spark, namely a query scheduler and a query executor, and embeds an efficient spatial Bloom filter into LocationSpark's indexes to avoid unnecessary network communication overhead when processing overlapped spatial data.

...read moreread less

Abstract: We present LocationSpark, a spatial data processing system built on top of Apache Spark, a widely used distributed data processing system. LocationSpark offers a rich set of spatial query operators, e.g., range search, kNN, spatio-textual operation, spatial-join, and kNN-join. To achieve high performance, LocationSpark employs various spatial indexes for in-memory data, and guarantees that immutable spatial indexes have low overhead with fault tolerance. In addition, we build two new layers over Spark, namely a query scheduler and a query executor. The query scheduler is responsible for mitigating skew in spatial queries, while the query executor selects the best plan based on the indexes and the nature of the spatial queries. Furthermore, to avoid unnecessary network communication overhead when processing overlapped spatial data, We embed an efficient spatial Bloom filter into LocationSpark's indexes. Finally, LocationSpark tracks frequently accessed spatial data, and dynamically flushes less frequently accessed data into disk. We evaluate our system on real workloads and demonstrate that it achieves an order of magnitude performance gain over a baseline framework.

...read moreread less

150 citations

Proceedings Article•DOI•

Big Data Analytics with Datalog Queries on Spark

[...]

Alexander Shkapsky¹, Mohan Yang¹, Matteo Interlandi¹, Hsuan Chiu¹, Tyson Condie¹, Carlo Zaniolo¹ - Show less +2 more•Institutions (1)

University of California, Los Angeles¹

14 Jun 2016

TL;DR: This work proposes compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark and performs an experimental comparison with other state-of-the-art large-scale Datalog systems to verify the efficacy of these techniques and effectiveness of Spark in supporting Datalogs-based analytics.

...read moreread less

Abstract: There is great interest in exploiting the opportunity provided by cloud computing platforms for large-scale analytics. Among these platforms, Apache Spark is growing in popularity for machine learning and graph analytics. Developing efficient complex analytics in Spark requires deep understanding of both the algorithm at hand and the Spark API or subsystem APIs (e.g., Spark SQL, GraphX). Our BigDatalog system addresses the problem by providing concise declarative specification of complex queries amenable to efficient evaluation. Towards this goal, we propose compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark. We perform an experimental comparison with other state-of-the-art large-scale Datalog systems and verify the efficacy of our techniques and effectiveness of Spark in supporting Datalog-based analytics.

...read moreread less

116 citations

Journal Article•DOI•

A Framework for Fast and Efficient Cyber Security Network Intrusion Detection Using Apache Spark

[...]

Govind P. Gupta¹, Manish Kulariya¹•Institutions (1)

National Institute of Technology, Raipur¹

01 Jan 2016-Procedia Computer Science

TL;DR: This paper has proposed a framework in which first a well-known feature selection algorithm is employed for selecting the most important features and then classification based intrusion detection method is used for fast and efficient detection of intrusion in the massive network traffic.

...read moreread less

95 citations

Proceedings Article•DOI•

MapReduce program synthesis

[...]

Calvin Smith¹, Aws Albarghouthi¹•Institutions (1)

University of Wisconsin-Madison¹

02 Jun 2016

TL;DR: This paper presents a new algorithm and tool for synthesizing programs composed of efficient data-parallel operations that can execute on cloud computing infrastructure and demonstrates the efficiency of the approach and the small number of examples it requires to synthesize correct, scalable programs.

...read moreread less

Abstract: By abstracting away the complexity of distributed systems, large-scale data processing platforms—MapReduce, Hadoop, Spark, Dryad, etc.—have provided developers with simple means for harnessing the power of the cloud. In this paper, we ask whether we can automatically synthesize MapReduce-style distributed programs from input–output examples. Our ultimate goal is to enable end users to specify large-scale data analyses through the simple interface of examples. We thus present a new algorithm and tool for synthesizing programs composed of efficient data-parallel operations that can execute on cloud computing infrastructure. We evaluate our tool on a range of real-world big-data analysis tasks and general computations. Our results demonstrate the efficiency of our approach and the small number of examples it requires to synthesize correct, scalable programs.

...read moreread less

89 citations

Proceedings Article•DOI•

Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks

[...]

Ovidiu-Cristian Marcu, Alexandru Costan¹, Gabriel Antoniu, María S. Pérez-Hernández²•Institutions (2)

Institut national des sciences appliquées¹, Technical University of Madrid²

12 Sep 2016

TL;DR: A fine characterization of the cases when each framework is superior is performed, and how this performance correlates to operators, to resource usage and to the specifics of the internal framework design is highlighted.

...read moreread less

Abstract: Big Data analytics has recently gained increasing popularity as a tool to process large amounts of data on-demand. Spark and Flink are two Apache-hosted data analytics frameworks that facilitate the development of multi-step data pipelines using directly acyclic graph patterns. Making the most out of these frameworks is challenging because efficient executions strongly rely on complex parameter configurations and on an in-depth understanding of the underlyingarchitectural choices. Although extensive research has been devoted to improving and evaluating the performance of such analytics frameworks, most of them benchmarkthe platforms against Hadoop, as a baseline, a rather unfair comparison consideringthe fundamentally different design principles. This paper aims to bring some justice in this respect, by directly evaluating the performance of Sparkand Flink. Our goal is to identify and explain the impact of the different architecturalchoices and the parameter configurations on the perceived end-to-end performance. To this end, we develop a methodology for correlating the parameter settings and the operators execution plan with the resource usage. We use this methodologyto dissect the performance of Spark and Flink with several representative batchand iterative workloads on up to 100 nodes. Our key finding is that there none of the two framework outperforms the other for all data types, sizes and job patterns. This paper performs a fine characterization of the cases when each framework is superior, and we highlight how this performance correlates to operators, to resource usage and to the specifics of the internal framework design.

...read moreread less

88 citations

Book•

An Architecture for Fast and General Data Processing on Large Clusters

[...]

Matei Zaharia¹•Institutions (1)

Massachusetts Institute of Technology¹

01 May 2016

TL;DR: This book proposes an architecture for cluster computing systems that can tackle emerging data processing workloads at scale, and proposes a simple extension to MapReduce that adds primitives for data sharing, called Resilient Distributed Datasets (RDDs), which is implemented in the open source Spark system.

...read moreread less

Abstract: Today, a myriad data sources, from the Internet to business operations to scientific instruments, produce large and valuable data streams. However, the processing capabilities of single machines have not kept up with the size of data. As a result, organizations increasingly need to scale out these computations to clusters of hundreds of machines. At the same time, the speed and sophistication required of data processing have grown. In addition to simple queries, complex algorithms like machine learning and graph analysis are becoming common. And in addition to batch processing, streaming analysis of real-time data is required to let organizations take timely action. Future computing platforms will need to not only scale out traditional workloads, but support these new applications too. This book, a revised version of the 2014 ACM Dissertation Award winning dissertation, proposes an architecture for cluster computing systems that can tackle emerging data processing workloads at scale. Whereas early cluster computing systems, like MapReduce, handled batch processing, our architecture also enables streaming and interactive queries, while keeping MapReduce's scalability and fault tolerance. And whereas most deployed systems only support simple one-pass computations (e.g., SQL queries), ours also extends to the multi-pass algorithms required for complex analytics like machine learning. Finally, unlike the specialized systems proposed for some of these workloads, our architecture allows these computations to be combined, enabling rich new applications that intermix, for example, streaming and batch processing. We achieve these results through a simple extension to MapReduce that adds primitives for data sharing, called Resilient Distributed Datasets (RDDs). We show that this is enough to capture a wide range of workloads. We implement RDDs in the open source Spark system, which we evaluate using synthetic and real workloads. Spark matches or exceeds the performance of specialized systems in many domains, while offering stronger fault tolerance properties and allowing these workloads to be combined. Finally, we examine the generality of RDDs from both a theoretical modeling perspective and a systems perspective. This version of the dissertation makes corrections throughout the text and adds a new section on the evolution of Apache Spark in industry since 2014. In addition, editing, formatting, drawing of illustrations, and links for the references have been added.

...read moreread less

Proceedings Article•DOI•

Matrix Computations and Optimization in Apache Spark

[...]

Reza Bosagh Zadeh¹, Xiangrui Meng, Alexander Ulanov², Burak Yavuz, Li Pu³, Shivaram Venkataraman⁴, Evan R. Sparks⁴, Aaron Staple, Matei Zaharia⁵ - Show less +5 more•Institutions (5)

Stanford University¹, Hewlett-Packard², Twitter³, University of California, Berkeley⁴, Massachusetts Institute of Technology⁵

13 Aug 2016

TL;DR: In this paper, the authors describe matrix computations available in the cluster programming framework, Apache Spark, which is able to exploit the computational power of a cluster, while running code written decades ago for a single core.

...read moreread less

Abstract: We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark provides abstractions and implementations for distributed matrices and optimization routines using these matrices. When translating single-node algorithms to run on a distributed cluster, we observe that often a simple idea is enough: separating matrix operations from vector operations and shipping the matrix operations to be ran on the cluster, while keeping vector operations local to the driver. In the case of the Singular Value Decomposition, by taking this idea to an extreme, we are able to exploit the computational power of a cluster, while running code written decades ago for a single core. Another example is our Spark port of the popular TFOCS optimization package, originally built for MATLAB, which allows for solving Linear programs as well as a variety of other convex programs. We conclude with a comprehensive set of benchmarks for hardware accelerated matrix computations from the JVM, which is interesting in its own right, as many cluster programming frameworks use the JVM. The contributions described in this paper are already merged into Apache Spark and available on Spark installations by default, and commercially supported by a slew of companies which provide further services.

...read moreread less

Proceedings Article•DOI•

A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning

[...]

Guolu Wang, Jungang Xu¹, Ben He•Institutions (1)

Chinese Academy of Sciences¹

01 Dec 2016

TL;DR: A novel method for tuning configuration of Spark based on machine learning is proposed, which is composed of binary classification and multi-classification and can be used to auto-tune the configuration parameters of Spark.

...read moreread less

Abstract: Apache Spark is an open source distributed data processing platform, which can use distributed memory abstraction to process large volume of data efficiently. With the application of Apache Spark more and more widely, some problems are exposed. One of the most important aspects is the performance problem. Apache Spark has more than 180 configuration parameters, which can be adjusted by users according to their own specific application so as to optimize the performance. Currently these parameters are tuned manually by trial and error, which is ineffective due to the large parameter space and the complex interactions among the parameters. In this paper, in order to make the parameter tuning process of Spark more effective, a novel method for tuning configuration of Spark based on machine learning is proposed, which is composed of binary classification and multi-classification. This method can be used to auto-tune the configuration parameters of Spark. Furthermore, several common machine learning algorithms based on the proposed method are explored, and experimental results show that decision tree model (C5.0) is the best model considering the accuracy and computational efficiency. Finally, the experimental results also show that the performance can get average 36% gain with the proposed method compared with the default configuration of Spark.

...read moreread less

Proceedings Article•DOI•

BigDebug: debugging primitives for interactive big data processing in spark

[...]

Muhammad Ali Gulzar¹, Matteo Interlandi¹, Seunghyun Yoo¹, Sai Deep Tetali¹, Tyson Condie¹, Todd Millstein¹, Miryung Kim¹ - Show less +3 more•Institutions (1)

University of California, Los Angeles¹

14 May 2016

TL;DR: BigDebug designs a set of interactive, real-time debugging primitives for big data processing in Apache Spark, the next generation data-intensive scalable cloud computing platform and shows that BigDebug supports debugging at interactive speeds with minimal performance impact.

...read moreread less

Abstract: Developers use cloud computing platforms to process a large quantity of data in parallel when developing big data analytics. Debugging the massive parallel computations that run in today’s data-centers is time consuming and error-prone. To address this challenge, we design a set of interactive, real-time debugging primitives for big data processing in Apache Spark, the next generation data-intensive scalable cloud computing platform. This requires re-thinking the notion of step-through debugging in a traditional debugger such as gdb, because pausing the entire computation across distributed worker nodes causes significant delay and naively inspecting millions of records using a watchpoint is too time consuming for an end user.First, BigDebug’s simulated breakpoints and on-demand watchpoints allow users to selectively examine distributed, intermediate data on the cloud with little overhead. Second, a user can also pinpoint a crash-inducing record and selectively resume relevant sub-computations after a quick fix. Third, a user can determine the root causes of errors (or delays) at the level of individual records through a fine-grained data provenance capability. Our evaluation shows that BigDebug scales to terabytes and its record-level tracing incurs less than 25% overhead on average. It determines crash culprits orders of magnitude more accurately and provides up to 100% time saving compared to the baseline replay debugger. The results show that BigDebug supports debugging at interactive speeds with minimal performance impact.

...read moreread less

Proceedings Article•DOI•

A Performance Comparison of Open-Source Stream Processing Platforms

[...]

Martin Andreoni Lopez¹, Antonio Gonzalez Pastana Lobato¹, Otto Carlos M. B. Duarte¹•Institutions (1)

Federal University of Rio de Janeiro¹

01 Dec 2016

TL;DR: Results show that the performance of native stream processing systems, Storm and Flink, is up to 15 times higher than the micro-batch processing system, Spark Streaming, and Spark Streaming is more robust to node failures and provides recovery without losses.

...read moreread less

Abstract: Distributed stream processing platforms is a new class of real-time monitoring systems that analyze and extracts knowledge from large continuous streams of data. This type of systems is crucial for providing high throughput and low latency required by Big Data or Internet of Things monitoring applications. This paper describes and analyzes three main open-source distributed stream- processing platforms: Storm Flink, and Spark Streaming. We analyze the system architectures and we compare their main features. We carry out two experiments concerning anomaly detection on network traffic to evaluate the throughput efficiency and the resilience to node failures. Results show that the performance of native stream processing systems, Storm and Flink, is up to 15 times higher than the micro-batch processing system, Spark Streaming. On the other hand, Spark Streaming is more robust to node failures and provides recovery without losses.

...read moreread less

Proceedings Article•DOI•

SparkR: Scaling R Programs with Spark

[...]

Shivaram Venkataraman¹, Zongheng Yang¹, Davies Liu, Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael J. Franklin¹, Ion Stoica¹, Matei Zaharia² - Show less +7 more•Institutions (2)

University of California, Berkeley¹, Massachusetts Institute of Technology²

14 Jun 2016

TL;DR: SparkR is presented, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell.

...read moreread less

Abstract: R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machine's memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level DataFrame API enables scalable computation and present some of the key details of our implementation.

...read moreread less

Journal Article•DOI•

Smart Meter Data Analytics: Systems, Algorithms, and Benchmarking

[...]

Xiufeng Liu¹, Lukasz Golab², Wojciech Golab², Ihab F. Ilyas², Shichao Jin² - Show less +1 more•Institutions (2)

Technical University of Denmark¹, University of Waterloo²

21 Nov 2016-ACM Transactions on Database Systems

TL;DR: This article designs a performance benchmark that includes common smart meter analytics tasks as well as a framework for online anomaly detection that is implemented and presents an algorithm for generating large realistic datasets from a small seed of real data.

...read moreread less

Abstract: Smart electricity meters have been replacing conventional meters worldwide, enabling automated collection of fine-grained (e.g., every 15 minutes or hourly) consumption data. A variety of smart meter analytics algorithms and applications have been proposed, mainly in the smart grid literature. However, the focus has been on what can be done with the data rather than how to do it efficiently. In this article, we examine smart meter analytics from a software performance perspective. First, we design a performance benchmark that includes common smart meter analytics tasks. These include offline feature extraction and model building as well as a framework for online anomaly detection that we propose. Second, since obtaining real smart meter data is difficult due to privacy issues, we present an algorithm for generating large realistic datasets from a small seed of real data. Third, we implement the proposed benchmark using five representative platforms: a traditional numeric computing platform (Matlab), a relational DBMS with a built-in machine learning toolkit (PostgreSQL/MADlib), a main-memory column store (“System C”), and two distributed data processing platforms (Hive and Spark/Spark Streaming). We compare the five platforms in terms of application development effort and performance on a multicore machine as well as a cluster of 16 commodity servers.

...read moreread less

Proceedings Article•DOI•

Matrix factorizations at scale: A comparison of scientific data analytics in spark and C+MPI using three case studies

[...]

Alex Gittens¹, Aditya Devarakonda¹, Evan Racah², Michael F. Ringenburg³, L. Gerhardt², Jey Kottalam¹, Jialin Liu², Kristyn Maschhoff³, Shane Canon², Jatin Chhugani, Pramod Sharma³, Jiyan Yang⁴, James Demmel¹, Jim Harrell³, Venkat Krishnamurthy³, Michael W. Mahoney¹, Prabhat² - Show less +13 more•Institutions (4)

University of California, Berkeley¹, Lawrence Berkeley National Laboratory², Cray³, Stanford University⁴

12 May 2016

TL;DR: In this article, the authors explore the trade-offs of performing linear algebra using Apache Spark compared to traditional C and MPI implementations on HPC platforms, and apply these methods to 1.6TB particle physics, 2.2TB and 16TB climate modeling and 1.1TB bioimaging data.

...read moreread less

Abstract: We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to 1.6TB particle physics, 2.2TB and 16TB climate modeling and 1.1TB bioimaging data. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance.

...read moreread less

Proceedings Article•DOI•

High-performance design of apache spark with RDMA and its benefits on various workloads

[...]

Xiaoyi Lu¹, Dipti Shankar¹, Shashank Gugnani¹, Dhabaleswar K. Panda¹•Institutions (1)

Ohio State University¹

01 Dec 2016

TL;DR: The RDMA-based Spark design is implemented as a pluggable module and it does not change any Spark APIs, which means that it can be combined with other existing enhanced designs for Apache Spark and Hadoop proposed in the community.

...read moreread less

Abstract: The in-memory data processing framework, Apache Spark, has been stealing the limelight for low-latency interactive applications, iterative and batch computations. Our early experience study [17] has shown that Apache Spark can be enhanced to leverage advanced features (e.g., RDMA) on highperformance networks (e.g., InfiniBand and RoCE) to improve the performance of shuffle phase. With the fast evolving of the Apache Spark ecosystem, the Spark architecture has been changing a lot. This motivates us to investigate whether the earlier RDMA design can be adapted and further enhanced for the new Apache Spark architecture. We also aim to improve the performance for various Spark workloads (e.g., Batch, Graph, SQL). In this paper, we present a detailed design for high-performance RDMA-based Apache Spark on high-performance networks. We conduct systematic performance evaluations on three modern clusters (Chameleon, SDSC Comet, and an in-house cluster) with cutting-edge InfiniBand technologies, such as latest IB EDR (100 Gbps) network, recently introduced Single Root I/O Virtualization (SR-IOV) technology for IB, etc. The evaluation results show that compared to the default Spark running with IP over InfiniBand (IPoIB), our proposed design can achieve up to 79% performance improvement for Spark RDD operation benchmarks (e.g., GroupBy, SortBy), up to 38% performance improvement for batch workloads (e.g., Sort and TeraSort in Intel HiBench), up to 46% performance improvement for graph processing workloads (e.g., PageRank), up to 32% performance improvement for SQL queries (e.g., Aggregation, Join) on varied scales (up to 1,536 cores) of bare-metal IB clusters. Performance evaluations on SR-IOV enabled IB clusters also show 37% improvement achieved by our RDMA-based design. Our RDMA-based Spark design is implemented as a pluggable module and it does not change any Spark APIs, which means that it can be combined with other existing enhanced designs for Apache Spark and Hadoop proposed in the community. To show this, we further evaluate the performance of a combined version of ‘RDMA-Spark+RDMA-HDFS’ and the numbers show that the combination can achieve the best performance with up to 82% improvement for Intel HiBench Sort and TeraSort on SDSC Comet cluster.

...read moreread less

Proceedings Article•DOI•

Performance evaluation of big data frameworks for large-scale data analytics

[...]

Jorge Veiga, Roberto R. Expósito, Xoán C. Pardo, Guillermo L. Taboada, Juan Tourifio - Show less +1 more

01 Dec 2016

TL;DR: Analysis of the results has shown that replacing Hadoop with Spark or Flink can lead to a reduction in execution times by 77% and 70% on average, respectively, for non-sort benchmarks.

...read moreread less

Abstract: The increasing adoption of Big Data analytics has led to a high demand for efficient technologies in order to manage and process large datasets. Popular MapReduce frameworks such as Hadoop are being replaced by emerging ones like Spark or Flink, which improve both the programming APIs and performance. However, few works have focused on comparing these frameworks. This paper addresses this issue by performing a comparative evaluation of Hadoop, Spark and Flink using representative Big Data workloads and considering factors like performance and scalability. Moreover, the behavior of these frameworks has been characterized by modifying some of the main parameters of the workloads such as HDFS block size, input data size, interconnect network or thread configuration. The analysis of the results has shown that replacing Hadoop with Spark or Flink can lead to a reduction in execution times by 77% and 70% on average, respectively, for non-sort benchmarks.

...read moreread less

Journal Article•DOI•

Three-dimensional spray–flow interaction in a spark-ignition direct-injection engine

[...]

Hao Chen¹, Peter M. Lillo¹, Volker Sick¹•Institutions (1)

University of Michigan¹

01 Jan 2016-International Journal of Engine Research

TL;DR: In this article, large efforts are currently being made toward improving internal combustion engine efficiency without degrading overall performance, which requires in-cycle combustion strategies that require in-cycling engines.

...read moreread less

Abstract: Large efforts are currently being made toward improving internal combustion engine efficiency without degrading overall performance. To this end, advanced combustion strategies that require in-cyli...

...read moreread less

Journal Article•DOI•

Semi-tensor compressed sensing

[...]

Dong Xie¹, Haipeng Peng¹, Lixiang Li¹, Yixian Yang¹•Institutions (1)

Beijing University of Posts and Telecommunications¹

01 Nov 2016-Digital Signal Processing

TL;DR: A new model for signal compression and reconstruction based on semi-tensor product, called STP-CS, which is a generalization of traditional CS and has the flexibility to choose a lower-dimensional sensing matrix for signal compressed and reconstruction.

...read moreread less

Proceedings Article•DOI•

Workload characterization and optimization of TPC-H queries on Apache Spark

[...]

Tatsuhiro Chiba¹, Tamiya Onodera¹•Institutions (1)

IBM¹

17 Apr 2016

TL;DR: This paper used the TPC-H benchmark as the optimization case study and gathered many perspective logs such as application, JVM, OS parameters, Spark configuration, and application code based on CPU characteristics to introduce several JVM and OS parameter optimization approaches for accelerating Spark performance.

...read moreread less

Abstract: Besides being an in-memory-oriented computing framework, Spark runs on top of Java Virtual Machines (JVMs), so JVM parameters must be tuned to improve Spark application performance. Misconfigured parameters and settings degrade performance. For example, using Java heaps that are too large often causes a long garbage collection pause time, which accounts for over 10–20% of application execution time. Moreover, recent computing nodes have many cores with simultaneous multi-threading technology and the processors on the node are connected via NUMA, so it is difficult to exploit best performance without taking into account of these hardware features. Thus, optimization in a full stack is also important. Not only JVM parameters but also OS parameters, Spark configuration, and application code based on CPU characteristics need to be optimized to take full advantage of underlying computing resources. In this paper, we used the TPC-H benchmark as our optimization case study and gathered many perspective logs such as application, JVM (e.g. GC and JIT), system utilization, and hardware events from a performance monitoring unit. We discuss current problems and introduce several JVM and OS parameter optimization approaches for accelerating Spark performance. As a result, our optimization exhibits 30–40% increase in speed on average and is up to 5x faster than the naive configuration.

...read moreread less

Proceedings Article•DOI•

SnappyData: A Hybrid Transactional Analytical Store Built On Spark

[...]

Jags Ramnarayan, Barzan Mozafari¹, Sumedh Wale, Sudhir Menon, Neeraj Kumar, Hemant Bhanawat, Soubhik Chakraborty, Yogesh Mahajan, Rishitesh Mishra, Kishor Bachhav - Show less +6 more•Institutions (1)

University of Michigan¹

26 Jun 2016

TL;DR: This work proposes a unified engine for real-time operational analytics, delivering stream analytics, OLTP and OLAP in a single integrated solution through a seamless integration of Apache Spark (as a big data computational engine) with GemFire (as an in-memory transactional store with scale-out SQL semantics).

...read moreread less

Abstract: In recent years, our customers have expressed frustration in the traditional approach of using a combination of disparate products to handle their streaming, transactional and analytical needs. The common practice of stitching heterogeneous environments in custom ways has caused enormous production woes by increasing development complexity and total cost of ownership. With SnappyData, an open source platform, we propose a unified engine for real-time operational analytics, delivering stream analytics, OLTP and OLAP in a single integrated solution. We realize this platform through a seamless integration of Apache Spark (as a big data computational engine) with GemFire (as an in-memory transactional store with scale-out SQL semantics). In this demonstration, after presenting a few use case scenarios, we exhibit SnappyData as our our in-memory solution for delivering truly interactive analytics (i.e., a couple of seconds), when faced with large data volumes or high velocity streams. We show that SnappyData can exploit state-of-the-art approximate query processing techniques and a variety of data synopses. Finally, we allow the audience to define various high-level accuracy contracts (HAC), to communicate their accuracy requirements with SnappyData in an intuitive fashion.

...read moreread less

Book Chapter•DOI•

SPARQLGX: Efficient Distributed Evaluation of SPARQL with Apache Spark

[...]

Damien Graux¹, Damien Graux², Louis Jachiet², Louis Jachiet¹, Pierre Genevès², Pierre Genevès¹, Nabil Layaïda², Nabil Layaïda¹ - Show less +4 more•Institutions (2)

Centre national de la recherche scientifique¹, University of Grenoble²

17 Oct 2016

TL;DR: This work proposes sparqlgx, an implementation of a distributed rdf datastore based on Apache Spark designed to leverage existing Hadoop infrastructures for evaluating sparql queries and shows that this approach scales better than these systems in terms of supported dataset size.

...read moreread less

Abstract: sparql is the w3c standard query language for querying data expressed in the Resource Description Framework (rdf). The increasing amounts of rdf data available raise a major need and research interest in building efficient and scalable distributed sparql query evaluators. In this context, we propose sparqlgx: our implementation of a distributed rdf datastore based on Apache Spark. sparqlgx is designed to leverage existing Hadoop infrastructures for evaluating sparql queries. sparqlgx relies on a translation of sparql queries into executable Spark code that adopts evaluation strategies according to (1) the storage method used and (2) statistics on data. We show that sparqlgx makes it possible to evaluate sparql queries on billions of triples distributed across multiple nodes, while providing attractive performance figures. We report on experiments which show how sparqlgx compares to related state-of-the-art implementations and we show that our approach scales better than these systems in terms of supported dataset size. With its simple design, sparqlgx represents an interesting alternative in several scenarios.

...read moreread less

Proceedings Article•DOI•

Spark-GPU: An accelerated in-memory data processing engine on clusters

[...]

Yuan Yuan¹, Meisam Fathi Salmi², Yin Huai, Kaibo Wang³, Rubao Lee¹, Xiaodong Zhang¹ - Show less +2 more•Institutions (3)

Ohio State University¹, PayPal², Google³

01 Dec 2016

TL;DR: The design and implementation of Spark-GPU is presented that enables Spark to utilize GPU's massively parallel processing ability to achieve both high performance and high throughput and improves the performance of machine learning workloads and SQL queries.

...read moreread less

Abstract: Apache Spark is an in-memory data processing system that supports both SQL queries and advanced analytics over large data sets. In this paper, we present our design and implementation of Spark-GPU that enables Spark to utilize GPU's massively parallel processing ability to achieve both high performance and high throughput. Spark-GPU transforms a general-purpose data processing system into a GPU-supported system by addressing several real-world technical challenges including minimizing internal and external data transfers, preparing a suitable data format and a batching mode for efficient GPU execution, and determining the suitability of workloads for GPU with a task scheduling capability between CPU and GPU. We have comprehensively evaluated Spark-GPU with a set of representative analytical workloads to show its effectiveness. Our results show that Spark-GPU improves the performance of machine learning workloads by up to 16.13x and the performance of SQL queries by up to 4.83x.

...read moreread less

Book Chapter•DOI•

Online Anomaly Energy Consumption Detection Using Lambda Architecture

[...]

Xiufeng Liu¹, Nadeem Iftikhar², Per Sieverts Nielsen¹, Alfred Heller¹•Institutions (2)

Technical University of Denmark¹, University College of Northern Denmark²

05 Sep 2016

TL;DR: A supervised learning and statistical-based anomaly detection method, and a Lambda system using the in-memory distributed computing framework, Spark and its extension Spark Streaming are implemented.

...read moreread less

Abstract: With the widely use of smart meters in the energy sector, anomaly detection becomes a crucial mean to study the unusual consumption behaviors of customers, and to discover unexpected events of using energy promptly. Detecting consumption anomalies is, essentially, a real-time big data analytics problem, which does data mining on a large amount of parallel data streams from smart meters. In this paper, we propose a supervised learning and statistical-based anomaly detection method, and implement a Lambda system using the in-memory distributed computing framework, Spark and its extension Spark Streaming. The system supports not only iterative refreshing the detection models from scalable data sets, but also real-time anomaly detection on scalable live data streams. This paper empirically evaluates the system and the detection algorithm, and the results show the effectiveness and the scalability of the lambda detection system.

...read moreread less

Proceedings Article•DOI•

SPARK – A Big Data Processing Platform for Machine Learning

[...]

Jian Fu¹, Junwei Sun, Kaiyuan Wang•Institutions (1)

Wuhan University of Technology¹

01 Dec 2016

TL;DR: This paper analyzes Spark's primary framework, core technologies, and run a machine learning instance on it and will analyze the results and introduce the hardware equipment.

...read moreread less

Abstract: Apache Spark is a distributed memory-based computing framework which is natural suitable for machine learning. Compared to Hadoop, Spark has a better ability of computing. In this paper, we analyze Spark's primary framework, core technologies, and run a machine learning instance on it. Finally, we will analyze the results and introduce our hardware equipment.

...read moreread less

Proceedings Article•

When apache spark meets FPGAs: a case study for next-generation DNA sequencing acceleration

[...]

Yu-Ting Chen¹, Jason Cong¹, Zhenman Fang¹, Jie Lei¹, Peng Wei¹ - Show less +1 more•Institutions (1)

University of California, Los Angeles¹

20 Jun 2016

TL;DR: This paper conducts an in-depth analysis of challenges at single-thread, single-node multi- thread, and multi-node levels, and proposes solutions including batch processing and the FPGA-as-a-Service framework to address them.

...read moreread less

Abstract: FPGA-enabled datacenters have shown great potential for providing performance and energy efficiency improvement. In this paper we aim to answer one key question: how can we efficiently integrate FPGAs into state-of-the-art big-data computing frameworks like Apache Spark? To provide a generalized methodology and insights for efficient integration, we conduct an in-depth analysis of challenges at single-thread, single-node multi-thread, and multi-node levels, and propose solutions including batch processing and the FPGA-as-a-Service framework to address them. With a step-by-step case study for the next-generation DNA sequencing application, we demonstrate how a straightforward integration with 1,000x slowdown can be tuned into an efficient integration with 2.6x overall system speedup and 2.4x energy efficiency improvement.

...read moreread less

Proceedings Article•DOI•

Big data management processing with Hadoop MapReduce and spark technology: A comparison

[...]

Ankush Verma¹, Ashik Hussain Mansuri¹, Neelesh Jain•Institutions (1)

Pacific University¹

18 Mar 2016

TL;DR: This paper extends Hadoop MapReduce working and Spark architecture with supporting kind of operation to perform and shows the differences between Hadoops MapReduced and Spark through Map and Reduce phase individually.

...read moreread less

Abstract: Hadoop MapReduce is processed for analysis large volume of data through multiple nodes in parallel. However MapReduce has two function Map and Reduce, large data is stored through HDFS. Lack of facility involve in MapReduce so Spark is designed to run for real time stream data and for fast queries. Spark jobs perform work on Resilient Distributed Datasets and directed acyclic graph execution engine. In this paper, we extend Hadoop MapReduce working and Spark architecture with supporting kind of operation to perform. We also show the differences between Hadoop MapReduce and Spark through Map and Reduce phase individually.

...read moreread less

Collapse