scispace - formally typeset
Search or ask a question
Book ChapterDOI

Logistic Regression on Hadoop Using PySpark

16 Dec 2019-pp 19-26
TL;DR: The purpose of this work was to see how effective Hadoop can be in terms of increasing the efficiency of working with Machine Learning for a given problem by implementing and training three Logistic Regression models.
Abstract: Training a Machine Learning (ML) model on bigger datasets is a difficult task to accomplish, especially when a high-end configuration is not accessible. A relatively good configuration may also not always produce quick outcomes and depending on the dataset size, the time taken would be anything between seconds to several hours. More often, the tasks we are interested in involve big datasets and complex models. The purpose of our work was to see how effective Hadoop can be in terms of increasing the efficiency of working with Machine Learning for a given problem. Out of many models to choose from, Logistic Regression was chosen, which is relatively simpler to implement. Three Logistic Regression models were implemented and trained on MNIST Handwritten Digits dataset. First one was implemented in Python using NumPy without any ML libraries. The second implementation used LogisticRegression class that comes with the Scikit-learn Python package, and the third implementation was done using PySpark MLlib. Towards the end of the paper, we present the observations and results obtained from the execution of each.
References
More filters
Book ChapterDOI
01 Jan 2010
TL;DR: A more precise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems.
Abstract: During the last decade, the data sizes have grown faster than the speed of processors. In this context, the capabilities of statistical machine learning methods is limited by the computing time rather than the sample size. A more precise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems. The large-scale case involves the computational complexity of the underlying optimization algorithm in non-trivial ways. Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems. In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.

5,561 citations

Proceedings ArticleDOI
03 May 2010
TL;DR: The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on.
Abstract: The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.

5,005 citations

Proceedings Article
25 Apr 2012
TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Abstract: We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.

4,151 citations

Journal Article
TL;DR: It is argued that a simple "one-vs-all" scheme is as accurate as any other approach, assuming that the underlying binary classifiers are well-tuned regularized classifiers such as support vector machines.
Abstract: We consider the problem of multiclass classification. Our main thesis is that a simple "one-vs-all" scheme is as accurate as any other approach, assuming that the underlying binary classifiers are well-tuned regularized classifiers such as support vector machines. This thesis is interesting in that it disagrees with a large body of recent published work on multiclass classification. We support our position by means of a critical review of the existing literature, a substantial collection of carefully controlled experimental work, and theoretical arguments.

1,841 citations

Journal ArticleDOI
TL;DR: In this paper, the differences and similarities of these models from a technical point of view, and compare them with other machine learning algorithms are summarized and compared using a set of quality criteria for logistic regression and artificial neural networks.

1,681 citations