Journal ArticleDOI
Apache Spark: a unified engine for big data processing
Matei Zaharia,Reynold Xin,Patrick Wendell,Tathagata Das,Michael Armbrust,Ankur Dave,Xiangrui Meng,Josh Rosen,Shivaram Venkataraman,Michael J. Franklin,Ali Ghodsi,Joseph E. Gonzalez,Scott Shenker,Ion Stoica +13 more
TLDR
This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.Abstract:
This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applicationsread more
Citations
More filters
Journal ArticleDOI
Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark
Athanasios Alexopoulos,Georgios Drakopoulos,Andreas Kanavos,Phivos Mylonas,Gerasimos Vonitsanos +4 more
TL;DR: This article extensively relies in two ways on classifiers implemented in MLlib, the main machine learning library for the Hadoop ecosystem, to reduce complexity while maintaining a similar if not better level of the metrics of accuracy, recall, and F 1.
Proceedings ArticleDOI
A Flexible Security Analytics Service for the Industrial IoT
Philip Empl,Günther Pernul +1 more
TL;DR: In this paper, the authors conceptualized a flexible security analytics service that implements security capabilities with flexible analytical techniques that fit specific SMEs' needs, and evaluated with a real-world use case.
Journal ArticleDOI
A Survey on Big Data Processing Frameworks for Mobility Analytics
TL;DR: In this article, a survey of big data processing frameworks for mobility analytics is presented, focusing on the underlying techniques; indexing, partitioning, query processing are essential for enabling efficient and scalable data management.
Posted Content
ML-AQP: Query-Driven Approximate Query Processing based on Machine Learning.
TL;DR: This work offers a solution that can provide approximate answers to aggregate queries, relying on Machine Learning (ML), which is able to work alongside Cloud systems, having low response times and monetary/computational costs and energy footprint.
Journal ArticleDOI
Coffea Columnar Object Framework For Effective Analysis
Nicholas Smith,Lindsey Gray,Matteo Cremonesi,B. Jayatilaka,Oliver Gutsche,Allison Reinsvold Hall,Kevin Pedro,Maria Acosta,Andrew Melo,S. Belforte,Jim Pivarski +10 more
TL;DR: This work will discuss the experience in implementing analysis of CMS data using the coffea framework, and a discussion of the user experience and future directions.
References
More filters
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Proceedings Article
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
Matei Zaharia,Mosharaf Chowdhury,Tathagata Das,Ankur Dave,Justin Ma,Murphy McCauley,Michael J. Franklin,Scott Shenker,Ion Stoica +8 more
TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Journal ArticleDOI
A bridging model for parallel computation
TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.
Proceedings ArticleDOI
Pregel: a system for large-scale graph processing
Grzegorz Malewicz,Matthew H. Austern,Aart J. C. Bik,James C. Dehnert,Ilan Horn,Naty Leiser,Grzegorz Czajkowski +6 more
TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.
Proceedings ArticleDOI
Dryad: distributed data-parallel programs from sequential building blocks
TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.