Apache Spark: a unified engine for big data processing

TL;DR: This article extensively relies in two ways on classifiers implemented in MLlib, the main machine learning library for the Hadoop ecosystem, to reduce complexity while maintaining a similar if not better level of the metrics of accuracy, recall, and F 1.

...read moreread less

Proceedings ArticleDOI

A Flexible Security Analytics Service for the Industrial IoT

Philip Empl, +1 more

TL;DR: In this paper, the authors conceptualized a flexible security analytics service that implements security capabilities with flexible analytical techniques that fit specific SMEs' needs, and evaluated with a real-world use case.

...read moreread less

Journal ArticleDOI

A Survey on Big Data Processing Frameworks for Mobility Analytics

Christos Doulkeridis, +3 more

TL;DR: In this article, a survey of big data processing frameworks for mobility analytics is presented, focusing on the underlying techniques; indexing, partitioning, query processing are essential for enabling efficient and scalable data management.

...read moreread less

Posted Content

ML-AQP: Query-Driven Approximate Query Processing based on Machine Learning.

Fotis Savva, +2 more

- 14 Mar 2020 -

arXiv: Databases

TL;DR: This work offers a solution that can provide approximate answers to aggregate queries, relying on Machine Learning (ML), which is able to work alongside Cloud systems, having low response times and monetary/computational costs and energy footprint.

...read moreread less

Journal ArticleDOI

Coffea Columnar Object Framework For Effective Analysis

Nicholas Smith, +10 more

- 04 Nov 2019 -

Epj Web of Conferences

TL;DR: This work will discuss the experience in implementing analysis of CMS data using the coffea framework, and a discussion of the user experience and future directions.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Proceedings Article

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Matei Zaharia, +8 more

TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.

...read moreread less

Journal ArticleDOI

A bridging model for parallel computation

Leslie G. Valiant

- 01 Aug 1990 -

Communications of The ACM

TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

...read moreread less

Proceedings ArticleDOI

Pregel: a system for large-scale graph processing

Grzegorz Malewicz, +6 more

TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.

...read moreread less

Proceedings ArticleDOI

Dryad: distributed data-parallel programs from sequential building blocks

Michael Isard, +4 more

TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.

...read moreread less

Collapse

Related Papers (5)

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008 -

Communications of The ACM

Apache Spark: a unified engine for big data processing

Citations

Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark

A Flexible Security Analytics Service for the Industrial IoT

A Survey on Big Data Processing Frameworks for Mobility Analytics

ML-AQP: Query-Driven Approximate Query Processing based on Machine Learning.

Coffea Columnar Object Framework For Effective Analysis

References

MapReduce: simplified data processing on large clusters

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

A bridging model for parallel computation

Pregel: a system for large-scale graph processing

Dryad: distributed data-parallel programs from sequential building blocks

Related Papers (5)

MapReduce: simplified data processing on large clusters

The Hadoop Distributed File System

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Spark: cluster computing with working sets

Scikit-learn: Machine Learning in Python