MLlib: machine learning in apache spark

Open AccessJournal Article

MLlib: machine learning in apache spark

Xiangrui Meng, +15 more

- 01 Jan 2016 -

Journal of Machine Learning Research

- Vol. 17, Iss: 1, pp 1235-1241

Chats0

TLDR

MLlib as mentioned in this paper is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.

Abstract:

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLLIB supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLLIB has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

IoT data analytics architecture for smart healthcare using RFID and WSN

Nur Banu Oğur, +2 more

- 11 Jan 2022 -

Etri Journal

TL;DR: New real‐time data analytics architecture for an IoT‐based smart healthcare system, which consists of a wireless sensor network and a radio‐frequency identification technology in a vertical domain and can handle large volumes of data, is introduced.

...read moreread less

Posted Content

Vamsa: Tracking Provenance in Data Science Scripts.

Mohammad Hossein Namaki, +5 more

TL;DR: Vamsa, a modular system that extracts provenance from Python scripts without requiring any changes to the user's code is presented, using up to 450K real-world data science scripts from Kaggle and publicly available Python notebooks.

...read moreread less

Journal ArticleDOI

Top Data Mining Tools for the Healthcare Industry

Judith Santos-Pereira, +3 more

- 08 Jun 2021 -

Journal of King Saud University - Comput...

TL;DR: A survey of popular open-source data mining tools in which data mining tool selection criteria based on healthcare application requirements is proposed and the best ones using the proposed selection criteria are identified.

...read moreread less

Book ChapterDOI

Actionable Pattern Discovery for Tweet Emotions

Angelina A. Tzacheva, +2 more

TL;DR: This work focuses on extracting Action Rules with respect to the Emotion class from user tweets, which discovers actionable recommendations, which suggests ways to alter the user’s emotion to a better or more positive state.

...read moreread less

Book ChapterDOI

ADABench - Towards an Industry Standard Benchmark for Advanced Analytics

Tilmann Rabl, +9 more

TL;DR: The digital revolution, rapidly decreasing storage cost, and remarkable results achieved by state of the art machine learning methods are driving widespread adoption of ML approaches.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Latent dirichlet allocation

David M. Blei, +2 more

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Proceedings Article

Latent Dirichlet Allocation

David M. Blei, +2 more

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less