MLlib: machine learning in apache spark

Open AccessJournal Article

MLlib: machine learning in apache spark

Xiangrui Meng, +15 more

- 01 Jan 2016 -

Journal of Machine Learning Research

- Vol. 17, Iss: 1, pp 1235-1241

Chats0

TLDR

MLlib as mentioned in this paper is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.

Abstract:

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLLIB supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLLIB has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

dislib: Large Scale High Performance Machine Learning in Python

Javier Álvarez Cid-Fuentes, +4 more

TL;DR: This paper presents and evaluates dislib, a distributed machine learning library on top of PyCOMPSs programming model that addresses the issues of other existing libraries and shows that dislib can be up to 9 times faster, and can process data sets up to 16 times larger than other popular distributed machineLearning libraries, such as MLlib.

...read moreread less

Journal ArticleDOI

CrossRec: Cross-Domain Recommendations Based on Social Big Data and Cognitive Computing

Yin Zhang, +5 more

- 01 Dec 2018 -

Mobile Networks and Applications

TL;DR: This work proposes a cross-domain recommender system, including three approaches, based on multi-source social big data, and shows that the accuracies of the three proposed approaches are significantly improved compared with the conventional recommender approaches, such as collaborative filtering and matrix factorization.

...read moreread less

Proceedings ArticleDOI

One-Pass Logistic Regression for Label-Drift and Large-Scale Classification on Distributed Systems

Vu Nguyen, +4 more

TL;DR: A novel variant of LR, namely one-pass logistic regression (OLR) is introduced to offer a principled treatment for label-drift and large-scale classifications and is extended to a distributed setting for parallelization, termed sparkling OLR (Spark-OLR).

...read moreread less

Book ChapterDOI

Cognitive Computing: Where Big Data Is Driving Us.

Ana Paula Appel, +2 more

TL;DR: The concepts and challenges to design Cognitive Systems are discussed, which will address the questions for Cognitive Systems: What are the needs?

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Latent dirichlet allocation

David M. Blei, +2 more

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Proceedings Article

Latent Dirichlet Allocation

David M. Blei, +2 more

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less