MLlib: machine learning in apache spark

Open AccessJournal Article

MLlib: machine learning in apache spark

Xiangrui Meng, +15 more

- 01 Jan 2016 -

Journal of Machine Learning Research

- Vol. 17, Iss: 1, pp 1235-1241

Chats0

TLDR

MLlib as mentioned in this paper is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.

Abstract:

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLLIB supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLLIB has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

Citations

PDF

Open Access

More filters

Posted Content

Real-time Text Analytics Pipeline Using Open-source Big Data Tools.

Hassan Nazeer, +4 more

- 12 Dec 2017 -

arXiv: Distributed, Parallel, and Cluste...

TL;DR: This paper explains and evaluates the effectiveness of the proposed real-time text processing pipeline based on Apache Kafka for data ingestion, Apache Spark for in-memory data processing, Apache Cassandra for storing processed results, and D3 JavaScript library for visualization, which shows less than a minute latency to process Tweets.

...read moreread less

Book ChapterDOI

Scalable Distributed Genetic Algorithm Using Apache Spark (S-GA)

Fahad Maqbool, +3 more

TL;DR: This paper presents an algorithm for Scalable Genetic Algorithms using Apache Spark (S-GA), which has been found to be more scalable and it can scale up to large dimensional optimization problems while yielding comparable results.

...read moreread less

Book ChapterDOI

Learning Models over Relational Data: A Brief Tutorial.

Maximilian Schleich, +4 more

TL;DR: In this paper, the state-of-the-art in learning models over relational databases is presented. And the authors make the case for a first-principles approach that exploits recent developments in database research.

...read moreread less

Proceedings ArticleDOI

Data integration in scalable data analytics platform for process industries

Martin Sarnovsky, +2 more

TL;DR: The aim was to design and develop the cross-sectorial scalable environment, which will enable the data collection from different sources and support the development of predictive functions to help the process industries in optimizing of their production processes.

...read moreread less

Proceedings ArticleDOI

Ecosystem: A Zombie Category?

Sami Hyrynsalmi, +1 more

TL;DR: Ulrich Beck’s theory of a zombie category is used as an instrument to raise the question whether the general concept of an ‘ecosystem’ is already a Zombie category term—a term which has already dead and empty in content, while still being alive and used.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Latent dirichlet allocation

David M. Blei, +2 more

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Proceedings Article

Latent Dirichlet Allocation

David M. Blei, +2 more

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less