Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

/pdf/xgboost-a-scalable-tree-boosting-system-48ocu6x4c7.pdf

XGBoost: A Scalable Tree Boosting System

This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications

https://dl.acm.org/doi/pdf/10.1145/2934664

Apache Spark: a unified engine for big data processing

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLLIB supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLLIB has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

/pdf/mllib-machine-learning-in-apache-spark-27ddi10m3m.pdf

MLlib: machine learning in apache spark

Graph is an important data representation which appears in a wide diversity of real-world scenarios. Effective graph analytics provides users a deeper understanding of what is behind the data, and thus can benefit a lot of useful applications such as node classification, node recommendation, link prediction, etc. However, most graph analytics methods suffer the high computation and space cost. Graph embedding is an effective yet efficient way to solve the graph analytics problem. It converts the graph data into a low dimensional space in which the graph structural information and graph properties are maximumly preserved. In this survey, we conduct a comprehensive review of the literature in graph embedding. We first introduce the formal definition of graph embedding as well as the related concepts. After that, we propose two taxonomies of graph embedding which correspond to what challenges exist in different graph embedding problem settings and how the existing work addresses these challenges in their solutions. Finally, we summarize the applications that graph embedding enables and suggest four promising future research directions in terms of computation efficiency, problem settings, techniques, and application scenarios.

A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications

Written mostly in Scala and with over 1000 contributors, Apache Spark has become the de facto standard for big data processing. In this talk, I will review the evolution of Spark for the last seven years and our experience using Scala as the main programming language in a high profile open source project with a distributed team. I will outline language features that we can't live without, and features we wish were designed differently. Last but not least, I will discuss how we at Databricks are leveraging native code to further improve performance.

Spark and Scala (keynote)

Author(s): Xin, Reynold Shi | Advisor(s): Franklin, Michael; Stoica, Ion | Abstract: Modern data analysis is undergoing a ``Big Data'' transformation: organizations are generating and gathering more data than ever before, in a variety of formats covering both structured and unstructured data, and employing increasingly sophisticated techniques such as machine learning and graph computation beyond the traditional roll-up and drill-down capabilities provided by SQL. To cope with the big data challenges, we believe that data processing systems will need to provide fine-grained fault recovery across a larger cluster of machines, support both SQL and complex analytics efficiently, and enable real-time computation.This dissertation builds on Apache Spark, a distributed dataflow engine, and creates three related systems: Spark SQL, Structured Streaming, and GraphX. Spark SQL combines relational and procedural processing through a new API called DataFrame. It also includes an extensible query optimizer to support a wide variety of data sources and analytic workloads. Structured Streaming extends Spark SQL's DataFrame API and query optimizer to automatically incrementalize queries, so users can reason about real-time stream data as batch datasets, and have the same application operate over both stream data and batch data. GraphX recasts graph specific system optimizations as dataflow optimizations, and provides an efficient framework for graph computation on top of Spark.The three systems have enjoyed wide adoption in industry and academia, and together they laid the foundation for Spark's 2.0 release. They demonstrate the feasibility and advantages of unifying disparate, specialized data systems on top of distributed dataflow systems.

/pdf/go-with-the-flow-graphs-streaming-and-relational-1dhzvpmpr6.pdf

Go with the Flow: Graphs, Streaming and Relational Computations over Distributed Dataflow

Traditional enterprise warehouse solutions center around an analytical
database system that is monolithic and inflexible: data needs to be extracted,
transformed, and loaded into the rigid relational form before analysis. It
takes years of sophisticated planning to provision and deploy a warehouse;
adding new hardware resources to an existing warehouse is an equally lengthy
and daunting task. Additionally, modern data analysis employs statistical methods that go well
beyond the typical roll-up and drill-down capabilities provided by warehouse
systems. Although it is possible to implement such methods using a combination
of SQL and UDFs, query engines in relational databases are ill-suited for
these. The Hadoop ecosystem introduces a suite of tools for data analytics that
overcome some of the problems of traditional solutions. These systems, however,
forgo years of warehouse research. Memory is significantly underutilized in
Hadoop clusters, and execution engine is naive compared with its relational
counterparts. It is time to rethink the design of data warehouse systems and take the best
from both worlds. The new generation of warehouse systems should be modular,
high performance, fault-tolerant, easy to provision, and designed to support
both SQL query processing and machine learning applications. This paper references the Shark system developed at Berkeley as an initial
attempt.

/pdf/the-end-of-an-architectural-era-for-analytical-databases-1w6w29w9wn.pdf

The End of an Architectural Era for Analytical Databases

Traditional enterprise warehouse solutions center around an analytical database system that is monolithic and inflexible: data needs to be extracted, transformed, and loaded into the rigid relational form before analysis. It takes years of sophisticated planning to provision and deploy a warehouse; adding new hardware resources to an existing warehouse is an equally lengthy and daunting task. 
Additionally, modern data analysis employs statistical methods that go well beyond the typical roll-up and drill-down capabilities provided by warehouse systems. Although it is possible to implement such methods using a combination of SQL and UDFs, query engines in relational databases are ill-suited for these. 
The Hadoop ecosystem introduces a suite of tools for data analytics that overcome some of the problems of traditional solutions. These systems, however, forgo years of warehouse research. Memory is significantly underutilized in Hadoop clusters, and execution engine is naive compared with its relational counterparts. 
It is time to rethink the design of data warehouse systems and take the best from both worlds. The new generation of warehouse systems should be modular, high performance, fault-tolerant, easy to provision, and designed to support both SQL query processing and machine learning applications. 
This paper references the Shark system developed at Berkeley as an initial attempt.

Many data management problems are inherently vague and hard for algorithms to process. Take for example entity resolution, also known as record linkage, the process to resolve records for the same entity from heterogeneous sources. Properly resolving such records require not only the syntactic structure of the data, but also contextual semantics that are hard for machines to understand. To properly perform such data management tasks requires human inputs for providing information that is missing from the structured data that machines can read, for performing computationally dicult functions, and for matching, ranking, or aggregating results based on fuzzy criteria.

/pdf/improving-data-management-applications-using-microtask-36dszu5x4s.pdf

Reynold Xin

Papers

Spark and Scala (keynote)

Go with the Flow: Graphs, Streaming and Relational Computations over Distributed Dataflow

The End of an Architectural Era for Analytical Databases

The End of an Architectural Era for Analytical Databases

Improving Data Management Applications Using Microtask Platforms