scispace - formally typeset
Proceedings ArticleDOI

Spark SQL: Relational Data Processing in Spark

TLDR
Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API, and includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language.
Abstract
Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

read more

Citations
More filters
Journal Article

MLlib: machine learning in apache spark

TL;DR: MLlib as mentioned in this paper is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.
Journal ArticleDOI

Big data preprocessing: methods and prospects

TL;DR: The definition, characteristics, and categorization of data preprocessing approaches in big data are introduced and research challenges are discussed, with focus on developments on different big data framework, such as Hadoop, Spark and Flink.
Proceedings Article

Opaque: an oblivious and encrypted distributed analytics platform

TL;DR: The proposed Opaque introduces new distributed oblivious relational operators that hide access patterns, and new query planning techniques to optimize these new operators to improve performance.
Journal ArticleDOI

Big data analytics on Apache Spark

TL;DR: This review shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing and highlights some research and development directions on Apache Spark for big data analytics.
References
More filters
Journal Article

Scikit-learn: Machine Learning in Python

TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Proceedings Article

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Journal Article

[''R"--project for statistical computing].

TL;DR: An introduction to the R project for statistical computing (www.R-project.org) is presented to make the professional community aware of "R" as a potent and free software for graphical and statistical analysis of medical data.
Proceedings ArticleDOI

Pig latin: a not-so-foreign language for data processing

TL;DR: A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.
Related Papers (5)
Trending Questions (2)
How does Spark SQL and DataFrame differ from RDD in terms of data processing and analysis techniques?

Spark SQL and DataFrame offer relational processing with optimized storage and declarative queries, while RDDs lack these features, making them less efficient for complex data analysis tasks.

Are there any performance considerations when choosing a language API SQL vs Python vs Scala in the context of spark?

We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.