MLlib: machine learning in apache spark

Open AccessJournal Article

MLlib: machine learning in apache spark

Xiangrui Meng, +15 more

- 01 Jan 2016 -

Journal of Machine Learning Research

- Vol. 17, Iss: 1, pp 1235-1241

Chats0

TLDR

MLlib as mentioned in this paper is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.

Abstract:

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLLIB supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLLIB has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

Citations

PDF

Open Access

More filters

Book ChapterDOI

Delay Prediction System for Large-Scale Railway Networks Based on Big Data Analytics

Luca Oneto, +7 more

TL;DR: This paper proposes a fast learning algorithm for predicting train delays based on the Extreme Learning Machine that fully exploits the recent in-memory large-scale data processing technologies.

...read moreread less

Journal Article

Detection and Extraction of Brain Hemorrhage on the CT-Scan Image using Hybrid Thresholding Method

Sumijan Sumijan, +2 more

- 04 Oct 2016 -

Computer Science and Information Technol...

TL;DR: This research will determine the area of the brain bleeding on each image slice CT - scan every patient, to detect and extract brain bleeding, so it can calculate the volume of thebrain bleeding.

...read moreread less

Proceedings ArticleDOI

A Machine Learning Approach to Demographic Prediction using Geohashes

Avipsa Roy, +1 more

TL;DR: This paper uses a machine learning approach to predict the gender and age of mobile phone users from a set of 3,252,950 anonymised GPS trajectories with 60,865 unique devices using a predictive model which is based upon the concept of Geohashes.

...read moreread less

Proceedings ArticleDOI

AFrame: Extending DataFrames for Large-Scale Modern Data Analysis

Phanwadee Sinthong, +1 more

TL;DR: The architecture of AFrame is presented, the underlying capabilities of AsterixDB that efficiently support modern data analytic operations are described, and an extensible micro-benchmark is introduced for use in evaluating DataFrame performance in both single-node and distributed settings via a collection of representative analytic operations.

...read moreread less

Journal ArticleDOI

A Model and Survey of Distributed Data-Intensive Systems

Alessandro Margara, +3 more

- 21 Mar 2022 -

ACM Computing Surveys

TL;DR: This research presents a meta-analyses of the immune system’s response to machine learning and its applications in the context of social reinforcement learning and reinforcement learning.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Latent dirichlet allocation

David M. Blei, +2 more

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Proceedings Article

Latent Dirichlet Allocation

David M. Blei, +2 more

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less