MLlib: machine learning in apache spark

Open AccessJournal Article

MLlib: machine learning in apache spark

Xiangrui Meng, +15 more

- 01 Jan 2016 -

Journal of Machine Learning Research

- Vol. 17, Iss: 1, pp 1235-1241

Chats0

TLDR

MLlib as mentioned in this paper is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.

Abstract:

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLLIB supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLLIB has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

DCF: A Dataflow-Based Collaborative Filtering Training Algorithm

Xiangyu Ju, +4 more

- 01 Aug 2018 -

International Journal of Parallel Progra...

TL;DR: A Dataflow-based Collaborative Filtering (DCF) algorithm that exploits fine-grain asynchronous feature of dataflow model to minimize synchronization overhead; leverages mini-batch technique to reduce computation and communication complexities; uses dummy edge and multicasting techniques to avoid fine- grain overhead of dependency checking and reduce data movement.

...read moreread less

DOI

Compilation and Code Optimization for Data Analytics

Amir Shaikhha

TL;DR: The vision of abstraction without regret argues that it is possible to use high-level languages for building performance-critical systems that allow for both productivity and high performance, instead of trading off the former for the latter.

...read moreread less

Journal ArticleDOI

BALS: Blocked Alternating Least Squares for Parallel Sparse Matrix Factorization on GPUs

Jing Chen, +3 more

- 01 Sep 2021 -

IEEE Transactions on Parallel and Distri...

TL;DR: In this paper, an efficient implementation of the alternative least squares (ALS) algorithm called BALS built on top of a new sparse matrix format for parallel matrix factorization is presented.

...read moreread less

Posted Content

Scalable Manifold Learning for Big Data with Apache Spark

Frank Schoeneman, +1 more

- 31 Aug 2018 -

arXiv: Distributed, Parallel, and Cluste...

TL;DR: In this article, the authors propose a distributed memory framework implementing end-to-end exact Isomap under Apache Spark model, without the need to provision data in the secondary storage.

...read moreread less

Journal ArticleDOI

Performance analysis of disease diagnostic system using IoMT and real‐time data analytics

Ali Çalhan, +1 more

- 10 Mar 2022 -

Concurrency and Computation: Practice an...

TL;DR: The performances of MLlib algorithms in the real‐time model developed for heart and diabetes disease are examined and the SVM algorithm with an accuracy rate of 93.33% for heart disease and the LR algorithm with 78.89% for diabetes are found to provide the best performances.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Latent dirichlet allocation

David M. Blei, +2 more

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Proceedings Article

Latent Dirichlet Allocation

David M. Blei, +2 more

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less