Home
/
Authors
/
Davies Liu

Author

Davies Liu

Bio: Davies Liu is an academic researcher. The author has contributed to research in topics: Spark (mathematics) & Programming with Big Data in R. The author has an hindex of 4, co-authored 4 publications receiving 2608 citations.

Papers

PDF

Open Access

More filters

Journal Article•

MLlib: machine learning in apache spark

[...]

Xiangrui Meng, Joseph K. Bradley, Burak Yavuz, Evan R. Sparks¹, Shivaram Venkataraman¹, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen², Doris Xin³, Reynold Xin, Michael J. Franklin¹, Reza Bosagh Zadeh⁴, Matei Zaharia⁵, Ameet Talwalkar⁶ - Show less +12 more•Institutions (6)

University of California, Berkeley¹, Cloudera², Urbana University³, Stanford University⁴, Massachusetts Institute of Technology⁵, University of California, Los Angeles⁶

01 Jan 2016-Journal of Machine Learning Research

TL;DR: MLlib as mentioned in this paper is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.

...read moreread less

Abstract: Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLLIB supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLLIB has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

...read moreread less

1,551 citations

Proceedings Article•DOI•

Spark SQL: Relational Data Processing in Spark

[...]

Michael Armbrust, Reynold Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan¹, Michael J. Franklin¹, Ali Ghodsi, Matei Zaharia - Show less +7 more•Institutions (1)

University of California, Berkeley¹

27 May 2015

TL;DR: Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API, and includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language.

...read moreread less

Abstract: Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

...read moreread less

1,230 citations

Posted Content•

MLlib: Machine Learning in Apache Spark

[...]

University of California, Berkeley¹, Cloudera², Urbana University³, Stanford University⁴, Massachusetts Institute of Technology⁵, University of California, Los Angeles⁶

26 May 2015-arXiv: Learning

TL;DR: MLlib as discussed by the authors is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.

...read moreread less

Abstract: Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

...read moreread less

84 citations

Proceedings Article•DOI•

SparkR: Scaling R Programs with Spark

[...]

Shivaram Venkataraman¹, Zongheng Yang¹, Davies Liu, Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael J. Franklin¹, Ion Stoica¹, Matei Zaharia² - Show less +7 more•Institutions (2)

University of California, Berkeley¹, Massachusetts Institute of Technology²

14 Jun 2016

TL;DR: SparkR is presented, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell.

...read moreread less

Abstract: R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machine's memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level DataFrame API enables scalable computation and present some of the key details of our implementation.

...read moreread less

65 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

XGBoost: A Scalable Tree Boosting System

[...]

Tianqi Chen¹, Carlos Guestrin¹•Institutions (1)

University of Washington¹

13 Aug 2016

TL;DR: XGBoost as discussed by the authors proposes a sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning to achieve state-of-the-art results on many machine learning challenges.

...read moreread less

Abstract: Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

...read moreread less

14,872 citations

Proceedings Article•DOI•

XGBoost: A Scalable Tree Boosting System

[...]

Tianqi Chen¹, Carlos Guestrin¹•Institutions (1)

University of Washington¹

09 Mar 2016-arXiv: Learning

TL;DR: This paper proposes a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning and provides insights on cache access patterns, data compression and sharding to build a scalable tree boosting system called XGBoost.

...read moreread less

13,333 citations

Journal Article•DOI•

Apache Spark: a unified engine for big data processing

[...]

Matei Zaharia¹, Reynold Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave², Xiangrui Meng, Josh Rosen, Shivaram Venkataraman², Michael J. Franklin², Ali Ghodsi², Joseph E. Gonzalez², Scott Shenker², Ion Stoica² - Show less +10 more•Institutions (2)

Stanford University¹, University of California, Berkeley²

28 Oct 2016-Communications of The ACM

TL;DR: This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.

...read moreread less

Abstract: This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications

...read moreread less

1,776 citations

Journal Article•

MLlib: machine learning in apache spark

[...]

University of California, Berkeley¹, Cloudera², Urbana University³, Stanford University⁴, Massachusetts Institute of Technology⁵, University of California, Los Angeles⁶

01 Jan 2016-Journal of Machine Learning Research

...read moreread less

1,551 citations

Journal Article•DOI•

Opportunities and obstacles for deep learning in biology and medicine.

[...]

Travers Ching¹, Daniel Himmelstein², Brett K. Beaulieu-Jones², Alexandr A. Kalinin³, Brian T. Do⁴, Gregory P. Way², Enrico Ferrero⁵, Paul-Michael Agapow⁶, Michael Zietz², Michael M. Hoffman⁷, Michael M. Hoffman⁸, Wei Xie⁹, Gail L. Rosen¹⁰, Benjamin J. Lengerich¹¹, Johnny Israeli¹², Jack Lanchantin¹³, Stephen Woloszynek¹⁰, Anne E. Carpenter¹⁴, Avanti Shrikumar¹², Jinbo Xu¹⁵, Evan M. Cofer¹⁶, Evan M. Cofer¹⁷, Christopher A. Lavender¹⁸, Srinivas C. Turaga¹⁹, Amr Alexandari¹², Zhiyong Lu¹⁸, David J. Harris²⁰, Dave DeCaprio, Yanjun Qi¹³, Anshul Kundaje¹², Yifan Peng¹⁸, Laura K. Wiley²¹, Marwin H. S. Segler²², Simina M. Boca²³, S. Joshua Swamidass²⁴, Austin Huang²⁵, Anthony Gitter²⁶, Anthony Gitter²⁷, Casey S. Greene² - Show less +35 more•Institutions (27)

University of Hawaii at Manoa¹, University of Pennsylvania², University of Michigan³, Harvard University⁴, GlaxoSmithKline⁵, Imperial College London⁶, University of Toronto⁷, Princess Margaret Cancer Centre⁸, Vanderbilt University⁹, Drexel University¹⁰, Carnegie Mellon University¹¹, Stanford University¹², University of Virginia¹³, Broad Institute¹⁴, Toyota Technological Institute at Chicago¹⁵, Trinity University¹⁶, Princeton University¹⁷, National Institutes of Health¹⁸, Howard Hughes Medical Institute¹⁹, University of Florida²⁰, University of Colorado Denver²¹, University of Münster²², Georgetown University Medical Center²³, Washington University in St. Louis²⁴, Brown University²⁵, University of Wisconsin-Madison²⁶, Morgridge Institute for Research²⁷

01 Apr 2018-Journal of the Royal Society Interface

TL;DR: It is found that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art.

...read moreread less

Abstract: Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.

...read moreread less

1,491 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse