Spark SQL: Relational Data Processing in Spark

doi:10.1145/2723372.2742797

Proceedings ArticleDOI

Spark SQL: Relational Data Processing in Spark

- pp 1383-1394

TLDR

Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API, and includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language.

Abstract:

Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

Citations

PDF

Open Access

More filters

Journal Article

MLlib: machine learning in apache spark

Xiangrui Meng, +15 more

- 01 Jan 2016 -

Journal of Machine Learning Research

TL;DR: MLlib as mentioned in this paper is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.

...read moreread less

Journal ArticleDOI

Big data preprocessing: methods and prospects

Salvador García, +4 more

TL;DR: The definition, characteristics, and categorization of data preprocessing approaches in big data are introduced and research challenges are discussed, with focus on developments on different big data framework, such as Hadoop, Spark and Flink.

...read moreread less

Proceedings Article

Opaque: an oblivious and encrypted distributed analytics platform

Wenting Zheng, +5 more

TL;DR: The proposed Opaque introduces new distributed oblivious relational operators that hide access patterns, and new query planning techniques to optimize these new operators to improve performance.

...read moreread less

Journal ArticleDOI

Big data analytics on Apache Spark

Salman Salloum, +4 more

- 13 Oct 2016 -

Journal of data science

TL;DR: This review shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing and highlights some research and development directions on Apache Spark for big data analytics.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Matei Zaharia, +8 more

TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.

...read moreread less

Journal Article

[''R"--project for statistical computing].

Ram Benny Dessau, +1 more

- 28 Jan 2008 -

Ugeskrift for Læger

TL;DR: An introduction to the R project for statistical computing (www.R-project.org) is presented to make the professional community aware of "R" as a potent and free software for graphical and statistical analysis of medical data.

...read moreread less

Proceedings ArticleDOI

Pig latin: a not-so-foreign language for data processing

Christopher Olston, +4 more

TL;DR: A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.

...read moreread less

Collapse

Related Papers (5)

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Matei Zaharia, +8 more

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008 -

Communications of The ACM

Spark SQL: Relational Data Processing in Spark

Citations

Apache Spark: a unified engine for big data processing

MLlib: machine learning in apache spark

Big data preprocessing: methods and prospects

Opaque: an oblivious and encrypted distributed analytics platform

Big data analytics on Apache Spark

References

Scikit-learn: Machine Learning in Python

Scikit-learn: Machine Learning in Python

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

[''R"--project for statistical computing].

Pig latin: a not-so-foreign language for data processing

Related Papers (5)

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

MapReduce: simplified data processing on large clusters

Spark: cluster computing with working sets

Hive: a warehousing solution over a map-reduce framework

The Hadoop Distributed File System

Trending Questions (2)