MLlib: machine learning in apache spark

Open AccessJournal Article

MLlib: machine learning in apache spark

Xiangrui Meng, +15 more

- 01 Jan 2016 -

Journal of Machine Learning Research

- Vol. 17, Iss: 1, pp 1235-1241

Chats0

TLDR

MLlib as mentioned in this paper is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.

Abstract:

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLLIB supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLLIB has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Constructing dynamic ontologies from biomedical publications

Megha Nagabhushan, +3 more

TL;DR: A semantic framework that aims to automatically generate an ontology by extracting assertions and topics from multiple free-text scientific publications in PubMed is proposed and it is shown that the ontology generated may be very effective in biomedical applications with scientific publications.

...read moreread less

Efficient Matrix Multiplication in Hadoop.

Song Deng, +1 more

TL;DR: The method observably improves the performance of dense matrix multiplication in MapReduce and proposes a user feedback method to avoid the overheads of starting multiple map waves.

...read moreread less

Proceedings ArticleDOI

Optimization for Classical Machine Learning Problems on the GPU

Sören Laue, +2 more

TL;DR: The GENO framework is extended, allowing the user to specify constrained optimization problems in an easy-to-read modeling language and a solver is automatically generated from this specification, which outperforms state-of-the-art approaches like CVXPY combined with a GPU-accelerated solver such as cuOSQP or SCS by a few orders of magnitude.

...read moreread less

Journal ArticleDOI

Online CQI‐based optimization using k‐means and machine learning approach under sparse system knowledge

Brijesh Shah, +4 more

- 01 Feb 2020 -

International Journal of Communication S...

TL;DR: An optimization framework for SONs based on channel quality indicator (CQI) and loading condition without detail knowledge of the network environment is offered, to ensure efficient network operation by electrical tilt‐based radio frequency (RF) performance optimization using a machine learning approach.

...read moreread less

Proceedings ArticleDOI

Bayesian Networks with Structural Restrictions: Parallelization, Performance, and Efficient Cross-Validation

Hao Peng, +2 more

TL;DR: This paper implemented several parallel algorithms as well as an efficient way to perform cross-validations, resulting in significant speedups, including Naive Bayes, Tree Augmented Naives, k- BAN, and k-BAN with Order Swapping.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Latent dirichlet allocation

David M. Blei, +2 more

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Proceedings Article

Latent Dirichlet Allocation

David M. Blei, +2 more

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less