MLlib: machine learning in apache spark

Open AccessJournal Article

MLlib: machine learning in apache spark

Xiangrui Meng, +15 more

- 01 Jan 2016 -

Journal of Machine Learning Research

- Vol. 17, Iss: 1, pp 1235-1241

Chats0

TLDR

MLlib as mentioned in this paper is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.

Abstract:

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLLIB supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLLIB has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Applications of Neural Networks in Biomedical Data Analysis

Romano Weiss, +3 more

- 21 Jun 2022 -

Advances in Cardiovascular Diseases

TL;DR: This review discusses the latest networks and how they work, with a focus on the analysis of biomedical data, particularly biomarkers in bioimage data, and presents a data analysis of publications about neural networks to provide a quantitative insight into the use of network types and the number of journals per year.

...read moreread less

System-Aware Optimization for Machine Learning at Scale

Virginia Smith

TL;DR: A general optimization framework for machine learning, CoCoA, is proposed that gives careful consideration to systems parameters, often incorporating them directly into the method and theory, and can achieve orders-of-magnitude speedups for solving modern machine learning problems at scale.

...read moreread less

Sharing without Showing: Building Secure Collaborative Systems

Wenting Zheng

TL;DR: This dissertation presents four systems that utilize hardware enclaves as well as advanced cryptographic techniques for secure computation on workloads that range from SQL analytics to machine learning that are orders of magnitude faster compared to prior work or the more straightforward ways of integrating cryptography into systems.

...read moreread less

Book ChapterDOI

Challenges in Storing and Processing Big Data Using Hadoop and Spark

Shaik Abdul Khalandar Basha, +3 more

TL;DR: This chapter discusses the need of new analytical platforms, in particular different big data frameworks like Hadoop and Spark, along with MapReduce programming concepts, and describes steps to develop programs in Spark Streaming, Spark SQL and GRAPHX.

...read moreread less

Tool support for architectural decision making in large software intensive projects

Manoj Mahabaleshwar

TL;DR: A tool (ADeX) is presented to manage ADDs and support the decision-making process and can answer questions about what decisions have been made and which elements and quality attributes are affected by a decision.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Latent dirichlet allocation

David M. Blei, +2 more

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Proceedings Article

Latent Dirichlet Allocation

David M. Blei, +2 more

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less