MLlib: machine learning in apache spark

Open AccessJournal Article

MLlib: machine learning in apache spark

Xiangrui Meng, +15 more

- 01 Jan 2016 -

Journal of Machine Learning Research

- Vol. 17, Iss: 1, pp 1235-1241

Chats0

TLDR

MLlib as mentioned in this paper is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.

Abstract:

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLLIB supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLLIB has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Graphical Flow-based Spark Programming

Tanmaya Mahapatra, +1 more

- 01 Dec 2020 -

Journal of Big Data

TL;DR: Use cases for Spark have been prototyped and evaluated to demonstrate code-abstraction, automatic data abstraction interconversion and automatic generation of target Spark programs, which are the keys to lower the complexity and its ensued learning curve involved in the development of Big Data applications.

...read moreread less

A text mining framework for Big Data

Niki Pavlopoulou, +3 more

TL;DR: This work focuses on a proprietary complete near real-time automated classiﬁcation framework for unstructured data with the use of Natural Language Processing and Machine Learning algorithms on Apache Spark that achieves a comparable accuracy with respect to some of the best approaches presented in the literature.

...read moreread less

Journal ArticleDOI

KATZ centrality with biogeography-based optimization for influence maximization problem

Abbas Salehi, +1 more

- 01 Jul 2020 -

Journal of Combinatorial Optimization

TL;DR: The objective was to use an enhanced meta-heuristic algorithm with measuring centrality to solve the IM problem and it is well known that the proposed algorithm is more efficient, accurate, and faster than influence maximization greedy approaches.

...read moreread less

Dissertation

From Detection to Discourse: Tracking Events and Communities in Breaking News

Igor Brigadir

TL;DR: This thesis presents techniques developed with the intention of supporting journalists who monitor social media for breaking news, starting with the curation of newsworthy sources, through to implementing an alert system forBreaking news events, tracking the evolution of these stories over time, and finally exploring the language used by different communities to gain insights into the discourse around an event.

...read moreread less

Proceedings ArticleDOI

Online car-hailing dispatch: Deep supply-demand gap forecast on spark

Ji Li, +1 more

TL;DR: The pilot investigation reports applying deep learning model to make the supply-demand forecasting with the data from internet in real time is strong promised, and distributed memory calculation is very effective to improve the training speed and modeling capability of LSTM neural networks in this study.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Latent dirichlet allocation

David M. Blei, +2 more

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Proceedings Article

Latent Dirichlet Allocation

David M. Blei, +2 more

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less