MLlib: machine learning in apache spark

Open AccessJournal Article

MLlib: machine learning in apache spark

Xiangrui Meng, +15 more

- 01 Jan 2016 -

Journal of Machine Learning Research

- Vol. 17, Iss: 1, pp 1235-1241

Chats0

TLDR

MLlib as mentioned in this paper is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.

Abstract:

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLLIB supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLLIB has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Cluster-size optimization within a cloud-based ETL framework for Big Data

Eftim Zdravevski, +4 more

TL;DR: A cloud-based ETL framework where a general cluster-size optimization algorithm is used, while providing implementation details, and is able to perform the required job within a predefined, and thus known, time.

...read moreread less

Posted Content

A Layered Aggregate Engine for Analytics Workloads

Maximilian Schleich, +4 more

- 20 Jun 2019 -

arXiv: Databases

TL;DR: LMFAO as discussed by the authors is an in-memory optimization and execution engine for batches of aggregates over the input database, which consists of several layers of logical and code optimizations that systematically exploit sharing of computation, parallelism, and code specialization.

...read moreread less

Journal ArticleDOI

Leveraging resource management for efficient performance of Apache Spark

Khadija Aziz, +2 more

- 01 Dec 2019 -

Journal of Big Data

TL;DR: This study uses the Machine Learning library of Spark to implement different machine learning algorithms, then it is used to manage the resources (CPU, memory, and Disk) in order to assess the performance of Apache Spark, and investigates the tuning of the resource allocation in Spark.

...read moreread less

Journal ArticleDOI

A utilization model for optimization of checkpoint intervals in distributed stream processing systems

Sachini Jayasekara, +2 more

TL;DR: A rigorous derivation of utilization that incorporates checkpoint interval, failure rate, checkpoint cost, failure detection and restart cost, depth of the system topology and message delay yields an elegant expression for utilization and provides an optimal checkpoint interval given these parameters.

...read moreread less

Journal ArticleDOI

Big Data: Controlling Fraud by Using Machine Learning Libraries on Spark

Ferhat Karataş, +1 more

- 31 Mar 2018 -

International Journal of Applied Mathema...

TL;DR: This article has used k-Means method from the Machine Learning libraries on Spark to determine whether the incoming network values are normal behavior and detected 10 abnormal behaviors from 400 thousand network data with k-means method.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Latent dirichlet allocation

David M. Blei, +2 more

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Proceedings Article

Latent Dirichlet Allocation

David M. Blei, +2 more

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less