scispace - formally typeset
Search or ask a question
Topic

Programming with Big Data in R

About: Programming with Big Data in R is a research topic. Over the lifetime, 115 publications have been published within this topic receiving 38880 citations. The topic is also known as: pbdR.


Papers
More filters
Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

20,309 citations

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

17,663 citations

Book
14 Jun 2008
TL;DR: Programming with R: The Basics and Methods and Generic Functions and Interfaces I: Using C and Fortran.
Abstract: Introduction: Principles and Concepts.- Using R.- Programming with R: The Basics.- R Packages.- Objects.- Basic Data and Computations.- Data Visualization and Graphics.- Computing with Text.- New Classes.- Methods and Generic Functions.- Interfaces I: Using C and Fortran.- Interfaces II: Between R and Other Systems.- How R Works.- Errata and Notes for "Software for Data Analysis: Programming with R".

307 citations

Journal ArticleDOI
TL;DR: This work presents a framework for the R statistical computing language that provides a simple yet powerful programming interface to a computational cluster of CPUs that allows the rapid development of R functions that distribute independent computations across the nodes of the computational cluster.
Abstract: Theoretically, many modern statistical procedures are trivial to parallelize. However, practical deployment of a parallelized implementation which is robust and reliably runs on different computational cluster configurations and environments is far from trivial. We present a framework for the R statistical computing language that provides a simple yet powerful programming interface to a computational cluster of CPUs. This interface allows the rapid development of R functions that distribute independent computations across the nodes of the computational cluster. The approach can be extended to finer grain parallelization if needed. The resulting framework allows statisticians to obtain significant speed-ups for some computations at little additional development cost. The particular implementation can be deployed in ad-hoc heterogeneous computing environments.

111 citations

01 Jan 2008
TL;DR: This talk will introduce R, a language and environment for statistical computing and graphics that provides a wide variety of statistical techniques and techniques, and is highly extensible.
Abstract: The talk will introduce R, a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and non-linear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. R provides an Open Source route for research in statistical methodology. (C) R Foundation, from http://www.r-project.org.

108 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
78% related
Cloud computing
156.4K papers, 1.9M citations
72% related
Network packet
159.7K papers, 2.2M citations
72% related
Object (computer science)
106K papers, 1.3M citations
72% related
Scheduling (computing)
78.6K papers, 1.3M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20181
201710
201618
201524
201428
201311